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In multi-task learning several related tasks are considered simultaneously, 
with the hope that by an appropriate sharing of information across tasks, 
each task may benefit from the others. In the context of learning linear 
functions for supervised classification or regression, this can be achieved by 
including a priori information about the weight vectors associated with the 
tasks, and how they are expected to be related to each other. In this paper, we 
assume that tasks are clustered into groups, which are unknown beforehand, 
and that tasks within a group have similar weight vectors. We design a 
new spectral norm that encodes this a priori assumption, without the prior 
knowledge of the partition of tasks into groups, resulting in a new convex 
optimization formulation for multi-task learning. We show in simulations 
on synthetic examples and on the iedb MHC-I binding dataset, that our 
approach outperforms well-known convex methods for multi-task learning, as 
well as related non convex methods dedicated to the same problem. 
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1 Introduction 



Regularization has emerged as a dominant theme in machine learning and statistics, 
providing an intuitive and principled tool for learning from high-dimensional data. 
In particular, regularization by squared Euclidean norms or squared Hilbert norms 
has been thoroughly studied in various settings, leading to efficient practical algo- 
rithms based on linear algebra, and to very good theoretical understanding (see, e.g., 
[1, 2]). In recent years, regularization by non Hilbert norms, such as norms with 
p 7^ 2, has also generated considerable interest for the inference of linear functions 
in supervised classification or regression. Indeed, such norms can sometimes both 
make the problem statistically and numerically better-behaved, and impose various 
a priori knowledge on the problem. For example, the £^-norm (the sum of absolute 
values) imposes some of the components to be equal to zero and is widely used to 
estimate sparse functions [3], while various combinations of £^ norms can be defined 
to impose various sparsity patterns. 

While most recent work has focused on studying the properties of simple well- 
known norms, we take the opposite approach in this paper. That is, assuming a 
given prior knowledge, how can we design a norm that will enforce it? 

More precisely, we consider the problem of multi-task learning, which has recently 
emerged as a very promising research direction for various applications [4]. In multi- 
task learning several related inference tasks are considered simultaneously, with the 
hope that by an appropriate sharing of information across tasks, each one may 
benefit from the others. When linear functions are estimated, each task is associated 
with a weight vector, and a common strategy to design multi-task learning algorithm 
is to translate some prior hypothesis about how the tasks are related to each other 
into constraints on the different weight vectors. For example, such constraints are 
typically that the weight vectors of the different tasks belong (a) to a Euclidean ball 
centered at the origin [5], which implies no sharing of information between tasks 
apart from the size of the different vectors, i.e., the amount of regularization, (b) 
to a ball of unknown center [5], which enforces a similarity between the different 
weight vectors, or (c) to an unknown low- dimensional subspace [6, 7]. 

In this paper, we consider a different prior hypothesis that we believe could be 
more relevant in some applications: the hypothesis that the different tasks are in fact 
clustered into different groups, and that the weight vectors of tasks within a group 
are similar to each other. A key difference with [5], where a similar hypothesis is 
studied, is that we don't assume that the groups are known a priori, and in a sense 
our goal is both to identify the clusters and to use them for multi-task learning. 
An important situation that motivates this hypothesis is the case where most of the 
tasks are indeed related to each other, but a few "outlier" tasks are very different, in 
which case it may be better to impose similarity or low- dimensional constraints only 
to a subset of the tasks (thus forming a cluster) rather than to all tasks. Another 
situation of interest is when one can expect a natural organization of the tasks into 
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clusters, such as when one wants to model the preferences of customers and believes 
that there are a few general types of customers with similar preferences within each 
type, although one does not know beforehand which customers belong to which 
types. Besides an improved performance if the hypothesis turns out to be correct, 
we also expect this approach to be able to identify the cluster structure among the 
tasks as a by-product of the inference step, e.g., to identify outliers or groups of 
customers, which can be of interest for further understanding of the structure of the 
problem. 

In order to translate this hypothesis into a working algorithm, we follow the 
general strategy mentioned above which is to design a norm or a penalty over the set 
of weights which can be used as regularization in classical inference algorithms. We 
construct such a penalty by first assuming that the partition of the tasks into clusters 
is known, similarly to [5]. We then attempt to optimize the objective function of 
the inference algorithm over the set of partitions, a strategy that has proved useful 
in other contexts such as multiple kernel learning [8]. This optimization problem 
over the set of partitions being computationally challenging, we propose a convex 
relaxation of the problem which results in an efficient algorithm. 



We consider m related inference tasks that attempt to learn linear functions over 
A" = M'^ from a training set of input/output pairs (xj, ?/i)j=i,...,„, where Xi E X and 
l/i G y. In the case of binary classification we usually take y = { — 1,+1}, while in 
the case of regression we take 3^ = R. Each training example {xi,yi) is associated 
to a particular task t G [1,^], and we denote by 2(t) C [l,n] the set of indices of 
training examples associated to the task t. Our goal is to infer m linear functions 
ft{x) = wjx, for t = l,...,m, associated to the different tasks. We denote by 
W = {wi . . . Wm) the d X m matrix whose columns are the successive vectors we 
want to estimate. 

We fix a loss function / : R x 3^ i-^ R that quantifies by l{f{x),y) the cost of 
predicting /(x) for the input x when the correct output is y. Typical loss functions 
include the square error in regression l{u, y) = ^{u — y)"^ or the hinge loss in binary 
classification l{u, y) = max(0, 1 — uy) with y E { — 1,1}. The empirical risk of a set 
of linear classifiers given in the matrix W is then defined as the average loss over 
the training set: 



In the sequel, we will often use the mx 1 vector 1 composed of ones, the mxm 
projection matrices U = 11^ /m whose entries are all equal to 1/m, as well as the 
projection matrix 11 = / — f/. 



2 Multi-task learning with clustered tasks 




(1) 
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In order to learn simultaneously the m tasks, we follow the now well-established 
approach which looks for a set of weight vectors W that minimizes the empirical 
risk regularized by a penalty functional, i.e., we consider the problem: 



min i{W) + \n{W), 



(2) 



where Q{W) can be designed from prior knowledge to constrain some sharing of 
information between tasks. For example, [5] suggests to penalize both the norms of 
the Wj's and their variance, i.e., to consider a function of the form: 



where w = {Yl^=i'^i) /''^ is the mean weight vector. This penalty enforces a clus- 
tering of the w^s towards their mean when j3 increases. Alternatively, [7] propose to 
penalize the trace norm of W: 



where ai(W), . . . ,o'^in{d,m)(W) are the successive singular values of W. This en- 
forces a low-rank solution in W, i.e., constrains the different Wj's to live in a low- 
dimensional subspace. 

Here we would like to define a penalty function fi(iy) that encodes as prior 
knowledge that tasks are clustered into r < m groups. To do so, let us first assume 
that we know beforehand the clusters, i.e., we have a partition of the set of tasks 
into r groups. In that case we can follow an approach proposed by [5] which for 
clarity we rephrase with our notations and slightly generalize now. For a given 
cluster c G let us denote Jl{c) C [l,m] the set of tasks in c, = 117(0)1 the 

number of tasks in the cluster c, and E the mxr binary matrix which describes the 
cluster assignment for the m tasks, i.e., Eij = 1 if task i is in cluster j, otherwise. 
Let us further denote by Wc = {J2i&j{c)'^i)/^c the average weight vector for the 
tasks in c, and recall that w = '^i) 1"^ denotes the average weight vector over 

all tasks. Finally it will be convenient to introduce the matrix M = E{E~^ E)^^E~^ . 
M can also be written L — I, where L is the normalized Laplacian of the graph 
G whose nodes are the tasks connected by an edge if and only if they are in the 
same cluster. Then we can define three semi-norms of interest on W that quantify 
different orthogonal aspects: 

• A global penalty, which measures on average how large the weight vectors are: 




(3) 



min(d,m) 




(4) 



1=1 



nmean{W)=n\\wf = tTWUW 



4 



• A measure of between-cluster variance, which quantifies how close to each 
other the different clusters are: 

r 
c=l 

• A measure of within-cluster variance, which quantifies the compactness of the 
different clusters: 

We note that both ^betweeniW) and ^wUhiniW) depend on the particular choice of 
clusters E, or equivalently of M. We now propose to consider the following general 
penalty function: 

n{W) = EM^meaniW) + E B^between{W) + ew^within{W) , (5) 

where Smi^b and ew are three non- negative parameters that can balance the impor- 
tance of the different components of the penalty. Plugging this quadratic penalty 
into (2) leads to the general optimization problem: 

min 1{W) + \iiWT.{M)-^W^ , (6) 

where 

S(M)^^ = smU + eb^M -U)+ ewil - M) . (7) 

Here we use the notation E(M) to insist on the fact that this quadratic penalty 
depends on the cluster structure through the matrix M. Observing that the matrices 
U, M — U and I — M are orthogonal projections onto orthogonal supplementary 
subspaces, we easily get from (7): 

E{M) = ejlU+e~B\M-U)+e^\l-M) = e^'l+{ejl-e^B')U+{e~^'-e^')M . (8) 

By choosing particular values for Sm.^b and Sw we can recover several situations. 
In particular: 

• For By/ = Eb = £m = ^1 we simply recover the Frobenius norm of W , which 
does not put any constraint on the relationship between the different tasks: 

TO 

i=l 
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For Ew = £b > £m, we recover the penalty of [5] without clusters: 



n{W) = tiW {bmU + esil - U)) = BMnWwf + £b 



\Wi - wf 



In that global similarity between tasks is enforced, in addition to the 

general constraint on their mean. The structure in clusters plays no role since 
the sum of the between- and within-cluster variance is independent of the 
particular choice of clusters. 

For Ew > Eb = £m we recover the penalty of [5] with clusters: 
n{W) = tiW {emM + Ew{I - M)) 



r 



J2{^^\\^<^\\' + — E Ik.-^ef ^ . (9) 



In order to enforce a cluster hypothesis on the tasks, we therefore see that a natural 
choice is to take Ew > £b > £m in (5). This would have the effect of penalizing 
more the within-cluster variance than the between-cluster variance, hence promoting 
compact clusters. Of course, a major limitation at this point is that we assumed 
the cluster structure known a priori (through the matrix E, or equivalently M). In 
many cases of interest, we would like instead to learn the cluster structure itself from 
the data. We propose to learn the cluster structure in our framework by optimizing 
our objective function (6) both in W and M, i.e., to consider the problem: 

min e{W) + XtTWE{M)-^W^ , (10) 

where Air denotes the set of matrices M = E{E^ E)~^E^ defined by a clustering 
of the m tasks into r clusters and E(M) is defined in (8). Denoting by Sr = 
{S(M) : M G M-r} the corresponding set of positive semidefinite matrices, we can 
equivalently rewrite the problem as: 

min 1{W) + \tiWT,-^W^ . (11) 

The objective function in (11) is jointly convex mW & j^dxm ^ ^ 
of m X m positive semidefinite matrices, however the (finite) set Sr is not convex, 
making this problem intractable. We are now going to propose a convex relaxation of 
(11) by optimizing over a convex set of positive semidefinite matrices that contains 

Sr . 
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3 Convex relaxation 



In order to formulate a convex relaxation of (11), let us first observe that in the 
penalty term (5) the cluster structure only contributes to the second and third terms 
^betweeniW) and fiwithiniW) , and that these penalties only depend on the centered 
version of W. In terms of matrices, only the last two terms of S(M)~^ in (7) depend 
on M, i.e., on the clustering, and these terms can be re- written as: 

£b(M -U)+ewiI-M) = UieeM + ewil - M))U. (12) 

Indeed, it is easy to check that M - U = MU = UMU, and that I - M = I - 
U - {M - U) = U - UMU = U{I - M)U. Intuitively, multiplying by U on the 
right {resp. on the left) centers the rows [resp. the columns) of a matrix, and both 
M — U and I — M are row- and column-centered. 

To simplify notations, let us introduce M = IIMII. Plugging (12) in (7) and (10), 
we get the penalty 

trWJ:{M)-^W^ = Em {tiW^WU) + {WU){eBM + ew{I - M)){WUy , (13) 

in which, again, only the second part needs to be optimized with respect to the 
clustering M. Denoting S~^(M) = EbM + ew{I — M), one can express Sc(M), 
using the fact that M is a projection: 

S,(M) = (e^i -e^')M + e^I. (14) 

Sc is characterized by M = IIMII, that is discrete by construction, hence the non- 
convexity of Sr- We have the natural constraints M > (i.e., M > —U), ^ M ^ / 
(i.e., ^ M ^ n and trM = r (i.e., trM = r — 1). A possible convex relaxation of 
the discrete set of matrices M is therefore {M : ^ M ^ /, trM = r — 1}. This 
gives an equivalent convex set Sc for Eg, namely: 

5c = {Sc G 5™ : a/ ^ S ^ /?/, trS = 7} , (15) 

with a = e^, (3 = and 7 = (m — r + l)^^^ + (r — 1)£:^^. Incorporating 
the first part of the penalty (13) into the empirical risk term by defining £c{W) = 
\i(W) + Em {tTW~^WU) , we are now ready to state our relaxation of (11): 

min iJW) + XtiUWE-^W^U . (16) 
3.1 Reinterpretation in terms of norms 

We denote = min^^g^^ triyS"^!^"^ the cluster norm (CN). For any convex 

set Sc, we obtain a norm on W (that we apply here to its centered version). By 
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putting some different constraints on the set Sc, we obtain different norms on W, 
and in fact all previous multi-task formulations may be cast in this way, i.e., by 
choosing a specific set of positive matrices Sc {e.g., trace constraint for the trace 
norm, and simply a singleton for the Frobenius norm). Thus, designing norms for 
multi-task learning is equivalent to designing a set of positive matrices. In this 
paper, we have investigated a specific set adapted for clustered-tasks, but other sets 
could be designed in other situations. 

Note that we have selected a simple spectral convex set Sc in order to make 
the optimization simpler in Section 3.3, but we could also add some additional 
constraints that encode the point-wise positivity of the matrix M. Finally, when 
r = 1 (one clusters) and r = m (one cluster per task), we get back the formulation 



3.2 Reinterpretation as a convex relaxation of K-means 

In this section we show that the semi- norm ||ni4^||^ that we have designed earlier, 
can be interpreted as a convex relaxation of K-means on the tasks [9] . Indeed, given 
W G M'^^™, K-means aims to decompose it in the form W = fiE~^ where fi G M'^^'' 
are cluster centers and E represents a partition. Given the partition E, the matrix 
/i is found by minimizing min^ II W^"*" ~ EijJ\\^. Thus, a natural strategy outlined 
by [9], is to alternate between optimizing /x, the partition E and the weight vectors 
W. We now show that our convex norm is obtained when minimizing in closed form 
with respect to fi and relaxing. 

By translation invariance, this is equivalent to minimizing min^ ||niy'''— n£'yu''^|||., 
If we add a penalization on fi of the form Xti E~^ Efifi^ , then a short calculation shows 
that the minimum with respect to /j, (i.e., after optimization of the cluster centers) 
is equal to 

tTUw^wu{UE{E^Ey^E^u/x + 1)-^ = tiuw^ wu{UMu / X + iy\ 

By comparing with Eq. (14), we see that our formulation is indeed a convex relax- 
ation of K-means. 

3.3 Primal optimization 

Let us now show in more details how (16) can be solved efficiently. Whereas a dual 
formulation could be easily derived following [8], a direct approach is to rewrite (16) 



which, if ic is differentiable, can be directly optimized by gradient-based methods on 
W since ||niy||^ = min^^G^c ^^nVrS~^W^^n is a quadratic semi-norm of W. This 



of [5]. 



as 




(17) 
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regularization term trniyS~^iy^n and its gradient can be computed efficiently 
using a semi-closed form. Indeed, since Sc as defined in (15) is a spectral set (i.e., 
it does depend only on eigenvalues of covariance matrices), we obtain a function of 
the singular values of UW (or equivalently the eigenvalues of ly^nvr): 

min trmysr^iy^n = min tiWU diagiXy^U^W^ , 

ScSSc AgK™, a<Xi<l3, Al=7, C/eO™ 

where (9™ is the set of orthogonal matrices in K"^^™. The optimal U is the matrix 
of the eigenvectors of ly^IIl^, and we obtain the value of the objective function at 
the optimum: 

m 2 

mintrnW/S-^^n = min ^ ^' 



se5 



111111 7 — , 

1=1 



where a and A are the vectors containing the singular values of UW and S respec- 
tively. Now, we simply need to be able to compute this function of the singular 
values. 

The only coupling in this formulation comes from the trace constraint. The 
Lagrangian corresponding to this constraint is: 

m 2 / m \ 



=1 



For u < 0, this is a decreasing function of Aj, so the minimum on Aj G is 
reached for Aj = (3. The dual function is then a linear non-decreasing function of 
(since a < ^ /m < (3 from the definition of a, /?, 7 in (15), which reaches it maximum 
value (on z/ < 0) at z/ = 0. Let us therefore now consider the dual for i/ > 0. (18) 
is then a convex function of A,. Canceling its derivative with respect to Aj gives 
that the minimum in A G M is reached for Aj = Oij^. Now this may not be in 
the constraint set (a,/5), so if Oi < ay/u then the minimum in Aj G [a,P] of (18) 
is reached for Aj = a, and if (jj > |3^/u it is reached for Aj = (3. Otherwise, it is 
reached for Aj = cxi/y/u. Reporting this in (18), the dual problem is therefore 

max Yl 2^^v^+ E + '^") + E {j + ' ' ^^^^ 

Since a closed form for this expression is known for each fixed value of u, one 
can obtain ||niy||^ (and the eigenvalues of E*) by Algorithm 1. The cancellation 
condition in Algorithm 1 is that the value canceling the derivative belongs to (a, b), 
i.e., 

^ = / ^^ R +\ ^ ^) ' 

\ 7 — [an + pn+j / 
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Algorithm 1 Computing \\A\\l 



Require: A, 7. 
Ensure: \\A\\l, A*. 

Compute the singular values ai of A. 

2 2 

Order the ^ in a vector / (with an additional at the beginning), 
for all interval (a, b) of / do 

if ^^^Q^ '"^ is canceled on z/ G (a, h) then 

Replace u* in the dual function £(A*,z/) to get ||v4||^, compute A* on (a, 6). 
return \\A\\l, A*, 
end if 
end for 



where n and n"*" are the number of (Tj < as/v and cTj > respectively. In 

order to perform the gradient descent, we also need to compute ^'^^^^^ This can 
be computed directly using A*, by: 
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^ 2^ ajjwt ^ spin. 

' dai A* dW dim 



4 Experiments 
4.1 Artificial data 

We generated synthetic data consisting of two clusters of two tasks. The tasks are 
vectors of M*^, d = 30. For each cluster, a center Wc was generated in R'^-^, so 
that the two clusters be orthogonal. More precisely, each Wc had {d — 2)/2 random 
features randomly drawn from A/'(0,o"^), cr^ = 900, and {d — 2)/2 zero features. 
Then, each tasks t was computed as Wt + Wc{t), where c(t) was the cluster of t. Wt 
had the same zero feature as its cluster center, and the other features were drawn 
from A/'(0,cr^), cr^ = 16. The last two features were non-zero for all the tasks and 
drawn from A/'(0, cr^). For each task, 2000 points were generated and a normal noise 
of variance cr^^ = 150 was added. 

In a first experiment, we compared our cluster norm ||.||^ with the single-task 
learning given by the Frobenius norm, and with the trace norm, that corresponds to 
the assumption that the tasks live in a low-dimension space. The multi-task kernel 
approach being a special case of CN, its performance will always be between the 
performance of the single task and the performance of CN. 

In a second setting, we compare CN to alternative methods that differ in the 
way they learn S: 

• The True metric approach, that simply plugs the actual clustering in E and 



10 



optimizes W using this fixed metric. This necessitates to know the true clus- 
tering a priori, and can be thought of hke a golden standard. 

• The k-means approach, that alternates between optimizing the tasks in W 
given the metric S and re- learning S by clustering the tasks Wi [9]. The 
clustering is done by a k-means run 3 times. This is a non convex approach, 
and different initialization of k-means may result in different local minima. 

We also tried one run of CN followed by a run of True metric using the learned E 
reprojected in Sr by rounding, z.e., by performing k-means on the eigenvectors of the 
learned S [Reprojected approach), and a run of k-means starting from the relaxed 
solution [CNinit approach). 

Only the first method requires to know the true clustering a priori, all the other 
methods can be run without any knowledge of the clustering structure of the tasks. 

Each method was run with different numbers of training points. The training 
points were equally separated between the two clusters and for each cluster, 5/6th of 
the points were used for the first task and l/6th for the second, in order to simulate 
a natural setting were some tasks have fewer data. We used the 2000 points of each 
task to build 3 training folds, and the remaining points were used for testing. We 
used the mean RMSE across the tasks as a criterion, and a quadratic loss for i{W). 

The results of the first experiment are shown on Figure 1 (left). As expected, 
both multi-task approaches perform better than the approach that learns each task 
independently. CN penalization on the other hand always gives better testing error 
than the trace norm penalization, with a stronger advantage when very few training 
points are available. When more training points become available, all the methods 
give more and more similar performances. In particular, with large samples, it is 
not useful anymore to use a multi-task approach. 

Figure 1 (right) shows the results of the second experiment. Using the true 
metric always gives the best results. For 28 training points, no method recovers 
the correct clustering structure, as displayed on Figure 2, although CN performs 
slightly better than the k-means approach since the metric it learns is more diffuse. 
For 50 training points, CN performs much better than the k-means approach, which 
completely fails to recover the clustering structure as illustrated by the S learned 
for 28 and 50 training points on Figure 2. In the latter setting, CN partially recovers 
the clusters. When more training points become available, the k-means approach 
perfectly recovers the clustering structure and outperforms the relaxed approach. 
The reprojected approach, on the other hand, performs always as well as the best 
of the two other methods. The CNinit approach results are not displayed since the 
are the same as for the reprojected method. 
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Number of training points (log) Number of training points (log) 



Figure 1: RMSE versus number of training points for the tested methods. 
4.2 MHC-I binding data 

We also apphed our method to the lEDB MHC-I peptide binding benchmark pro- 
posed in [10]. This database contains binding affinities of various peptides, i.e., 
short amino-acid sequences, with different MHC-I molecules. This binding process 
is central in the immune system, and predicting it is crucial, for example to design 
vaccines. The affinities are thresholded to give a prediction problem. Each MHC-I 
molecule is considered as a task, and the goal is to predict whether a peptide binds 
a molecule. We used an orthogonal coding of the amino acids to represent the pep- 
tides and balanced the data by keeping only one negative example for each positive 
point, resulting in 15236 points involving 35 different molecules. We chose a logistic 
loss for i{W). 

Multi-task learning approaches have already proved useful for this problem, see 
for example [11, 12]. Besides, it is well known in the vaccine design community that 
some molecules can be grouped into empirically defined supertypes known to have 
similar binding behaviors. 

[12] showed in particular that the multi-task approaches were very useful for 
molecules with few known binders. Following this observation, we consider the 
mean error on the 10 molecules with less than 200 known ligands, and report the 
results in Table 1. We did not select the parameters by internal cross validation, but 
chose them among a small set of values in order to avoid overfitting. More accurate 
results could arise from such a cross validation, in particular concerning the number 
of clusters (here we limited the choice to 2 or 10 clusters). 

The pooling approach simply considers one global prediction problem by pooling 
together the data available for all molecules. The results illustrate that it is better 
to consider individual models than one unique pooled model, even when few data 
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Figure 2: Recovered S with CN (upper line) and k-means (lower line) for 28, 50 and 
100 points. 

Table 1: Prediction error for the 10 molecules with less than 200 training peptides 
in lEDB. 

Method Pooling Frobenius MT kernel Trace norm Cluster Norm 

Test error 26.53% ± 2.0 11.62% ± 1.4 10.10% ± 1.4 9.20% ± 1.3 8.71% ± 1.5 



points are available. On the other hand, all the multitask approaches improve the 
accuracy, the cluster norm giving the best performance. The learned S, however, 
did not recover the known supertypes, although it may contain some relevant infor- 
mation on the binding behavior of the molecules. Finally, the reprojection methods 
{reprojected and CNinit) did not improve the performance, potentially because the 
learned structure was not strong enough. 

5 Conclusion 

We have presented a convex approach to clustered multi-task learning, based on 
the design of a dedicated norm. Promising results were presented on synthetic 
examples and on the lEDB dataset. We are currently investigating more refined 
convex relaxations and the natural extension to non-linear multi-task learning as 
well as the inclusion of specific features on the tasks, which has shown to improve 
performance in other settings [6]. 
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